{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The top k approach\n",
    "\n",
    "The top-k method is a useful method to inspect a model's results. It simply consists of looking at the **most and least successful examples** to identify patterns within them. These patterns can then be used to engineer new features, or iterate on existing ones.\n",
    "\n",
    "First, we load the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "from sklearn.externals import joblib\n",
    "\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from ml_editor.data_processing import (\n",
    "    format_raw_df, get_split_by_author, \n",
    "    add_text_features_to_df, \n",
    "    get_vectorized_series, \n",
    "    get_feature_vector_and_label\n",
    ")\n",
    "from ml_editor.model_evaluation import get_top_k\n",
    "\n",
    "data_path = Path('../data/writers.csv')\n",
    "df = pd.read_csv(data_path)\n",
    "df = format_raw_df(df.copy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we add features and split the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = add_text_features_to_df(df.loc[df[\"is_question\"]].copy())\n",
    "train_df, test_df = get_split_by_author(df, test_size=0.2, random_state=40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load the trained model, and vectorize the features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = Path(\"../models/model_1.pkl\")\n",
    "clf = joblib.load(model_path) \n",
    "vectorizer_path = Path(\"../models/vectorizer_1.pkl\")\n",
    "vectorizer = joblib.load(vectorizer_path) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"vectors\"] = get_vectorized_series(train_df[\"full_text\"].copy(), vectorizer)\n",
    "test_df[\"vectors\"] = get_vectorized_series(test_df[\"full_text\"].copy(), vectorizer)\n",
    "\n",
    "features = [\n",
    "                \"action_verb_full\",\n",
    "                \"question_mark_full\",\n",
    "                \"text_len\",\n",
    "                \"language_question\",\n",
    "            ]\n",
    "X_train, y_train = get_feature_vector_and_label(train_df, features)\n",
    "X_test, y_test = get_feature_vector_and_label(test_df, features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we'll use the top k method to look at:\n",
    "\n",
    "- The k best performing examples for each class (high and low scores)\n",
    "- The k worst performing examples for each class\n",
    "- The k most unsure examples, where our models prediction probability is close to .5\n",
    "\n",
    "To read more about how plotting these particular examples can help with model iteration, please refer to Chapter 5 of the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_analysis_df = test_df.copy()\n",
    "y_predicted_proba = clf.predict_proba(X_test)\n",
    "test_analysis_df[\"predicted_proba\"] = y_predicted_proba[:, 1]\n",
    "test_analysis_df[\"true_label\"] = y_test\n",
    "\n",
    "to_display = [\n",
    "    \"predicted_proba\",\n",
    "    \"true_label\",\n",
    "    \"Title\",\n",
    "    \"body_text\",\n",
    "    \"text_len\",\n",
    "    \"action_verb_full\",\n",
    "    \"question_mark_full\",\n",
    "    \"language_question\",\n",
    "]\n",
    "threshold = 0.5\n",
    "\n",
    "\n",
    "top_pos, top_neg, worst_pos, worst_neg, unsure = get_top_k(test_analysis_df, \"predicted_proba\", \"true_label\", k=2)\n",
    "pd.options.display.max_colwidth = 500"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most confident correct positive predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_proba</th>\n",
       "      <th>true_label</th>\n",
       "      <th>Title</th>\n",
       "      <th>body_text</th>\n",
       "      <th>text_len</th>\n",
       "      <th>action_verb_full</th>\n",
       "      <th>question_mark_full</th>\n",
       "      <th>language_question</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>30514</th>\n",
       "      <td>0.72</td>\n",
       "      <td>True</td>\n",
       "      <td>Classic fantasy races lazy or boring?</td>\n",
       "      <td>In my fantasy world there are other other sentient races besides humans. Some of them are \"original\", or at least not as common in mainstream stories. On the other hand, I have also included more traditional races like elves. I'm going to  continue using elves for my example. Obviously, I have tried to put a personal spin on the elves and not just copy-paste them from Tolkien or anywhere else.\\nSomewhere along the way, I realized that I could change them to be any other race I like while mai...</td>\n",
       "      <td>1785</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32835</th>\n",
       "      <td>0.70</td>\n",
       "      <td>True</td>\n",
       "      <td>Traits of Bad Writers - Analysing Popular Authors</td>\n",
       "      <td>I realise that this question can fall in the scope of personal opinion but I am looking for something concrete.\\nVery often, not only on this site but other writing related sites, I find people constantly say that certain popular authors are, in reality, bad writers. These include the likes of J.K. Rowling, Dan Brown, Christopher Paolini, etc. Reasons range from minor plot holes to being too verbose. However, that should not exactly be true or should rather be minor insignificant things as o...</td>\n",
       "      <td>1097</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       predicted_proba  true_label  \\\n",
       "Id                                   \n",
       "30514             0.72        True   \n",
       "32835             0.70        True   \n",
       "\n",
       "                                                   Title  \\\n",
       "Id                                                         \n",
       "30514              Classic fantasy races lazy or boring?   \n",
       "32835  Traits of Bad Writers - Analysing Popular Authors   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 body_text  \\\n",
       "Id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n",
       "30514  In my fantasy world there are other other sentient races besides humans. Some of them are \"original\", or at least not as common in mainstream stories. On the other hand, I have also included more traditional races like elves. I'm going to  continue using elves for my example. Obviously, I have tried to put a personal spin on the elves and not just copy-paste them from Tolkien or anywhere else.\\nSomewhere along the way, I realized that I could change them to be any other race I like while mai...   \n",
       "32835  I realise that this question can fall in the scope of personal opinion but I am looking for something concrete.\\nVery often, not only on this site but other writing related sites, I find people constantly say that certain popular authors are, in reality, bad writers. These include the likes of J.K. Rowling, Dan Brown, Christopher Paolini, etc. Reasons range from minor plot holes to being too verbose. However, that should not exactly be true or should rather be minor insignificant things as o...   \n",
       "\n",
       "       text_len  action_verb_full  question_mark_full  language_question  \n",
       "Id                                                                        \n",
       "30514      1785              True                True              False  \n",
       "32835      1097              True                True              False  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_pos[to_display]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most confident correct negative predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_proba</th>\n",
       "      <th>true_label</th>\n",
       "      <th>Title</th>\n",
       "      <th>body_text</th>\n",
       "      <th>text_len</th>\n",
       "      <th>action_verb_full</th>\n",
       "      <th>question_mark_full</th>\n",
       "      <th>language_question</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7488</th>\n",
       "      <td>0.08</td>\n",
       "      <td>False</td>\n",
       "      <td>Releases needed for picture books?</td>\n",
       "      <td>Do you need location releases for national parks and model releases for Pets to use in picture books?\\n</td>\n",
       "      <td>137</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16799</th>\n",
       "      <td>0.09</td>\n",
       "      <td>False</td>\n",
       "      <td>Capitalization of Open form Compound Words in Titles</td>\n",
       "      <td>What would be considered proper capitalization of open form compound words in titles?  Should the second part of the compound word be capitalized?  Why?\\nFor example, the capitalization for which title would be correct?\\n\\nCash flow Analysis Report\\n\\n--OR--\\n\\nCash Flow Analysis Report\\n\\nThanks!\\n</td>\n",
       "      <td>343</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       predicted_proba  true_label  \\\n",
       "Id                                   \n",
       "7488              0.08       False   \n",
       "16799             0.09       False   \n",
       "\n",
       "                                                      Title  \\\n",
       "Id                                                            \n",
       "7488                     Releases needed for picture books?   \n",
       "16799  Capitalization of Open form Compound Words in Titles   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                          body_text  \\\n",
       "Id                                                                                                                                                                                                                                                                                                                    \n",
       "7488                                                                                                                                                                                                        Do you need location releases for national parks and model releases for Pets to use in picture books?\\n   \n",
       "16799  What would be considered proper capitalization of open form compound words in titles?  Should the second part of the compound word be capitalized?  Why?\\nFor example, the capitalization for which title would be correct?\\n\\nCash flow Analysis Report\\n\\n--OR--\\n\\nCash Flow Analysis Report\\n\\nThanks!\\n   \n",
       "\n",
       "       text_len  action_verb_full  question_mark_full  language_question  \n",
       "Id                                                                        \n",
       "7488        137             False                True              False  \n",
       "16799       343              True                True               True  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top_neg[to_display]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems most of the correct negative predictions have **short length**. This result reinforces the feature importance analysis which showed question length as one of the most important features.\n",
    "\n",
    "Let's look at the most confident incorrect negative predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_proba</th>\n",
       "      <th>true_label</th>\n",
       "      <th>Title</th>\n",
       "      <th>body_text</th>\n",
       "      <th>text_len</th>\n",
       "      <th>action_verb_full</th>\n",
       "      <th>question_mark_full</th>\n",
       "      <th>language_question</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>17543</th>\n",
       "      <td>0.15</td>\n",
       "      <td>True</td>\n",
       "      <td>Using Pronoun 'It' repetitvely for emphasis?</td>\n",
       "      <td>I'd like to know if using \"It\" repetitively (for emphasis) in this context is okay grammatically.\\n\\nTV has become the modern day baby sitter.  It is raising our children.  It is dictating the cultural narrative and shaping future society.  It is raising the bored inattentive child.  It is raising the consumer child. It is raising the aggressive child. It is raising the obese child. It is raising the misinformed and complacent child. It is raising the disenchanted child.  And what’s more, it...</td>\n",
       "      <td>591</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14157</th>\n",
       "      <td>0.17</td>\n",
       "      <td>True</td>\n",
       "      <td>Do you bold punctuation directly after bold text?</td>\n",
       "      <td>Do you bold punctuation directly after bold, linked or italic text? \\n</td>\n",
       "      <td>119</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       predicted_proba  true_label  \\\n",
       "Id                                   \n",
       "17543             0.15        True   \n",
       "14157             0.17        True   \n",
       "\n",
       "                                                   Title  \\\n",
       "Id                                                         \n",
       "17543       Using Pronoun 'It' repetitvely for emphasis?   \n",
       "14157  Do you bold punctuation directly after bold text?   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 body_text  \\\n",
       "Id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n",
       "17543  I'd like to know if using \"It\" repetitively (for emphasis) in this context is okay grammatically.\\n\\nTV has become the modern day baby sitter.  It is raising our children.  It is dictating the cultural narrative and shaping future society.  It is raising the bored inattentive child.  It is raising the consumer child. It is raising the aggressive child. It is raising the obese child. It is raising the misinformed and complacent child. It is raising the disenchanted child.  And what’s more, it...   \n",
       "14157                                                                                                                                                                                                                                                                                                                                                                                                                                               Do you bold punctuation directly after bold, linked or italic text? \\n   \n",
       "\n",
       "       text_len  action_verb_full  question_mark_full  language_question  \n",
       "Id                                                                        \n",
       "17543       591             False                True              False  \n",
       "14157       119             False                True              False  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "worst_pos[to_display]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the flipside, we find an overrepresentation of short questions with high scores in the examples our model got wrong.\n",
    "\n",
    "Next, let's look at the most confident incorrect positive predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_proba</th>\n",
       "      <th>true_label</th>\n",
       "      <th>Title</th>\n",
       "      <th>body_text</th>\n",
       "      <th>text_len</th>\n",
       "      <th>action_verb_full</th>\n",
       "      <th>question_mark_full</th>\n",
       "      <th>language_question</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7878</th>\n",
       "      <td>0.88</td>\n",
       "      <td>False</td>\n",
       "      <td>When quoting a person's informal speech, how much liberty do you have to make changes to what they say?</td>\n",
       "      <td>Even during a formal interview for a news article, people speak informally. They say \"uhm\", they cut off sentences half-way through, they interject phrases like \"you know?\", and they make innocent grammatical mistakes.\\nAs somebody who wants to fairly and accurately report the discussion that takes place in an interview, what guidelines should I use in making changes to what a person says?\\nWhile the simplest solution is to write exactly what they say and [sic] any errors they make, that can...</td>\n",
       "      <td>694</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37550</th>\n",
       "      <td>0.75</td>\n",
       "      <td>False</td>\n",
       "      <td>How to make sure that my reader has not forgotten an incident or character which was described earlier and referenced much later in the writing?</td>\n",
       "      <td>In the context of writing a fiction novel, how to make sure that readers remember several of the incidents or events or even characters that were described in earlier chapters, when the same gets referenced in the much later part? \\nI have also experienced this myself as a reader, especially in mystery novels where a minute detail or character described earlier ends up as a dramatic, concluding climax later. Reader gets a sense of being lost as they happen to forget such things either becaus...</td>\n",
       "      <td>1008</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       predicted_proba  true_label  \\\n",
       "Id                                   \n",
       "7878              0.88       False   \n",
       "37550             0.75       False   \n",
       "\n",
       "                                                                                                                                                  Title  \\\n",
       "Id                                                                                                                                                        \n",
       "7878                                            When quoting a person's informal speech, how much liberty do you have to make changes to what they say?   \n",
       "37550  How to make sure that my reader has not forgotten an incident or character which was described earlier and referenced much later in the writing?   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 body_text  \\\n",
       "Id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n",
       "7878   Even during a formal interview for a news article, people speak informally. They say \"uhm\", they cut off sentences half-way through, they interject phrases like \"you know?\", and they make innocent grammatical mistakes.\\nAs somebody who wants to fairly and accurately report the discussion that takes place in an interview, what guidelines should I use in making changes to what a person says?\\nWhile the simplest solution is to write exactly what they say and [sic] any errors they make, that can...   \n",
       "37550  In the context of writing a fiction novel, how to make sure that readers remember several of the incidents or events or even characters that were described in earlier chapters, when the same gets referenced in the much later part? \\nI have also experienced this myself as a reader, especially in mystery novels where a minute detail or character described earlier ends up as a dramatic, concluding climax later. Reader gets a sense of being lost as they happen to forget such things either becaus...   \n",
       "\n",
       "       text_len  action_verb_full  question_mark_full  language_question  \n",
       "Id                                                                        \n",
       "7878        694              True                True              False  \n",
       "37550      1008              True                True              False  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "worst_neg[to_display]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally, the most \"unsure\" questions, the ones where a model's probability is closest to equal for all classes (`.5` in our case since we have two classes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>predicted_proba</th>\n",
       "      <th>true_label</th>\n",
       "      <th>Title</th>\n",
       "      <th>body_text</th>\n",
       "      <th>text_len</th>\n",
       "      <th>action_verb_full</th>\n",
       "      <th>question_mark_full</th>\n",
       "      <th>language_question</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>26444</th>\n",
       "      <td>0.5</td>\n",
       "      <td>False</td>\n",
       "      <td>How to make a choice more sadistic?</td>\n",
       "      <td>How do I create an impossible choice for my protagonist? I want to place him in a painful dilemma, and I'm having trouble making the choice feel truly impossible.\\nFor example, I'm writing a scene where store owner Ed learns of an upcoming sting operation at his store. Ed is friends with both the cop running the operation, and the customer who's the intended target, so I thought that could be developed into a good flashpoint, torn between two friends and dangerous circumstances.\\nThe problem...</td>\n",
       "      <td>1065</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10263</th>\n",
       "      <td>0.5</td>\n",
       "      <td>False</td>\n",
       "      <td>How to define a research question when the output is a product, used in research?</td>\n",
       "      <td>My school requests a research question for a project. I came up with one like this: (different topic but very similar, due to NDA)\\n\\nWhat is the behavior of a bike, depending on speed, time in use and temperature?\\n\\nThis question is about the research, and defines what is wanted. However this is way too much for the time I will work at the project. My task is to design, build and verify the bike.\\nIf I leave the research question as it is, I will be having trouble at the end, because I can...</td>\n",
       "      <td>1345</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       predicted_proba  true_label  \\\n",
       "Id                                   \n",
       "26444              0.5       False   \n",
       "10263              0.5       False   \n",
       "\n",
       "                                                                                   Title  \\\n",
       "Id                                                                                         \n",
       "26444                                                How to make a choice more sadistic?   \n",
       "10263  How to define a research question when the output is a product, used in research?   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 body_text  \\\n",
       "Id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n",
       "26444  How do I create an impossible choice for my protagonist? I want to place him in a painful dilemma, and I'm having trouble making the choice feel truly impossible.\\nFor example, I'm writing a scene where store owner Ed learns of an upcoming sting operation at his store. Ed is friends with both the cop running the operation, and the customer who's the intended target, so I thought that could be developed into a good flashpoint, torn between two friends and dangerous circumstances.\\nThe problem...   \n",
       "10263  My school requests a research question for a project. I came up with one like this: (different topic but very similar, due to NDA)\\n\\nWhat is the behavior of a bike, depending on speed, time in use and temperature?\\n\\nThis question is about the research, and defines what is wanted. However this is way too much for the time I will work at the project. My task is to design, build and verify the bike.\\nIf I leave the research question as it is, I will be having trouble at the end, because I can...   \n",
       "\n",
       "       text_len  action_verb_full  question_mark_full  language_question  \n",
       "Id                                                                        \n",
       "26444      1065              True                True              False  \n",
       "10263      1345              True                True              False  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unsure[to_display]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find new candidate features, I recommend combining the top-k method with feature importance and vectorization."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml_editor",
   "language": "python",
   "name": "ml_editor"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
